In recent years, high-throughput screening has been used in the exploratory phase of the drug discovery process.At
this stage, thousands of small molecule compounds are tested for inhibitory capability against a biological target
in order to find a lead target molecule. Any molecule found to be inhibiting a biological function is further
optimized to be more potent. Also, at this period, a structure-activity relationship diagram can be developed
in which molecules having the same scaffold are analyzed for degrees of affect based on physical characteristics.
The objective of this analysis is to reveal hidden patterns, trends or relationships between intrinsic properties
of the molecule and effectiveness in inhibiting the ATPase activity of the helicase protein. This project provided
me with an opportunity to explore my experimental data using a proper data science methodology of data wrangling,
data analysis, data visualization, prediction, and data storytelling. Also please note that there are proprietary
software one can use to run PCA analysis on screening data.
The initial data was converted to an excel spreadsheet file by the microplate reader machine used for screening.
I further merged it with the molecule file that came from the vendor which contains molecular information such as
SMILES (1D chemical representation). I then imported it to Instant Jchem software. This software further calculated
molecular descriptors based on the chemical structure, such as LogP, LogD, number of H bond donors, number of Hydrogen
bond acceptors, number of Rotatable bonds, and whether it satisfies Lipinski’s Rule of 5.
The dataset used for this project was the result of combining assay results of different compound libraries used for a Dengue Virus and Hepatitis C Virus high throughput screening. It is currently in a csv format and has the following layout:
Initially, I wanted to create some plots that will answer the following questions: 1) Since the dataset contain two columns that describe the Hepatitis C and Dengue Virus %Inhibition, plot them side by side to visually see if there's any that is more potent as an HCV inhibitor and vice versa. 3) Does the number of rotatable bonds or number of hydrogen bond donors or number of hysrogwn bond acceptors or lypophilicity index affect the degree of inhibition of the compound for the Dengue virus? 4) What is the correlation between the intrinsic properties of a compound and its ability to inhibit the virus's helicase activity?
Some basic plots are listed below that show the trends and correlations between molecular descriptors and inhibitory capability. To conclude, we learn that 1) most of the compounds in this assay had poor inhibition against the helicase protein of both Hepatitis C and Dengue Virus and 2) a lower number of hydrogen bond donors correlate with a higher inhibition rate against the DENV helicase but the number of rotatable bonds do not.
In [1]:
# Import all the libraries
% matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Set-up matplotlib style
plt.style.use('fivethirtyeight')
# Load data
dataHTS = pd.read_excel('data/DENV_HTS_Data.xlsx', encoding='latin-1')
dataHTS.head(10)
Out[1]:
In [2]:
# First, I'd like to see a representation of the distribution of values.
# Does the number of rotatable bonds influence the percent inhibition
# exhibited by the molecule?
sns.set(style="whitegrid", palette="bright", color_codes=True)
# "Melt" the dataset to "long-form" or "tidy" representation
DENV = pd.melt(dataHTS, "RotatableBonds", var_name="LipinskiRuleof5")
# Draw a categorical scatterplot to show each observation
sns.swarmplot(x="RotatableBonds", y="DENVATPaseInh", hue="LipinskiRuleof5", data=dataHTS)
Out[2]:
In [3]:
# Next, I'd like to find out whether a molecule's number of hydrogen bond donors
# affect the ability of the molecule to pass the Lipinski Rule of 5 and also
# influence its degree of inhibition for the helicase activity in this dengue assay.
sns.set(style="darkgrid")
g = sns.FacetGrid(dataHTS, row="LipinskiRuleof5", col="HBondDonors", margin_titles=True)
bins = np.linspace(0, 60, 13)
g.map(plt.hist, "DENVATPaseInh", color="steelblue", bins=bins, lw=0)
Out[3]:
In [4]:
# Next, create a jointplot to determine whether the small molecule/compound
# is effective in inhibiting the helicase protein's activity of either
# Hepatitis C or Dengue viruses.
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("DENVATPaseInh", "HCVHDAInh", data=dataHTS, kind="reg",
xlim=(0, 90), ylim=(0, 90), color="r", size=7)
In [5]:
# Now, let's see if a better summarization of the disribution of inhibition is displayed
# by a box and whisker plot. First let's create a new dataframe with just the inhibition
# data for both HCV and Dengue.
dataHTS_boxplot = dataHTS[['HCVHDAInh','DENVATPaseInh']].copy()
# Set the width and height of the box plot.
plt.figure(figsize=(10,6))
# Limit the range of the y-axis. This is helpful because we have a lot of outliers.
plt.ylim(0,50)
plt.title('Box Plots Comparing Compound Inhibition of HCV and Dengue Viruses')
sns.boxplot(data=dataHTS_boxplot)
Out[5]:
In [6]:
# Now show a violin plot comparing the assay results of the compounds against
# HCV and dengue virus helicase. Same data as above is used.
plt.figure(figsize=(10,6))
plt.ylim(0,60)
plt.title('Violin Plots Comparing Compound Inhibition Targeting Helicase Activity of HCV and Dengue Viruses')
sns.violinplot(data=dataHTS_boxplot)
Out[6]:
In [7]:
# But first, let's find out the degree of correlation between intrinsic variables. Correlation gives
# an indication of how related the changes are between two variables. If two varibles change
# in the same direction then they are positively correlated. If they change in opposite directions together
# then they are negatively correlated. The correlation of independent variables is useful to know since some
# machine learning algorithms such as linear and logistic regression can have poor performance if there are
# highly correlated input variables in the data.
# But before running the correlation function, I'm going to drop some columns that I believe will not help
# predict another molecule's inhibitory capability. These variables only describe some external characteristic
# and not an intrinsic property of the molecule. These are compound ID, percent purity, molecular formula,
# lot number, daughter plate, plate map, IUPAC name, and SMILES.
# Drop some variables because they describe external characteristics not intrinsic properties.
# Also, at this point I just want to concentrate on Dengue virus inhibition so I'll also drop
# the column about the HCV assay result.
dataHTSmodel = dataHTS.drop(['CdId','PercentPurity','IUPACName','MolFormula','LotNumber','DaughterPlate','PlateMap','HCVHDAInh','SMILES'],1)
dataHTSmodel.head(10)
dataHTSmodelcorr = dataHTSmodel.corr()
# Let's take a look at the correlation table
dataHTSmodelcorr
Out[7]:
In [8]:
# An easier way to visualize this correlations plot is by a matrix plot.
# Generate a mask for the upper triangle.
mask = np.zeros_like(dataHTSmodelcorr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(dataHTSmodelcorr, mask=mask, cmap=cmap, vmax=.3,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[8]:
In [9]:
# LINEAR REGRESSION MODEL
# Now, I'd like to run a machine learning model against this data. I'd start with
# Linear Regression since the experimental result can be used as the dependent variable.
# Y = Dengue Virus ATPase Percent Inhibition (called "target" data in python, and referred
# to as the dependent variable or response variable)
# X = all the other features (or independent variables, predictors or explanatory variables)
# which we will use to fit
# Import regression modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
# This is code is based on the R modules. It is just a preliminary run since I haven't
# split the dataset into testing and training sets.
m = ols('DENVATPaseInh ~ LogP + LogD + HBondDonors + HBondAcceptors + RotatableBonds + LipinskiRuleof5', dataHTS).fit()
print(m.summary())
In [10]:
# Now, I'd like to run the Python-based modules.
# Before training the model, make sure we can test it with unseen data to validate the
# model for generalizability. There are different ways of doing train-test splits.
# Let's use a random 70:30 split for training:testing.
# At this point, eventhough the correlation plot showed some positive correlation between
# variables, I won't drop any because I feel that they are not strong enough to warrant doing.
import sklearn
from sklearn.linear_model import LinearRegression
# Let's define our independent and dependent variables for regression
X = dataHTSmodel.iloc[:,3:9].values
Y = dataHTSmodel['DENVATPaseInh'].values
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
X, Y, test_size=0.30, random_state = 5)
# Display the number of rows and columns in our split data sets.
print(X_train.shape)
print (X_test.shape)
print (Y_train.shape)
print (Y_test.shape)
In [11]:
# Create an instance of the linear regression model.
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
lm = LinearRegression()
# Now, train the model with all the 6 predictors using the training set
lm.fit(X_train, Y_train)
# First, check to see how much of the variance in the training data is captured by the model.
# We can compute R-squared for this.
print ("R-squared: ")
print (r2_score(Y_train, lm.predict(X_train)))
print
# Predict the output on the test set and print the first five values
print (lm.predict(X_test)[0:5])
print
# Look at the mean squared error on the test set first and then on the training set.
print ("Mean squared error on the test set: ")
print (mean_squared_error(Y_test, lm.predict(X_test)))
print ("Mean squared error on the training set: ")
print (mean_squared_error(Y_train, lm.predict(X_train)))
In [13]:
# Let's see if we have a homogeneity of variance between the independent variables and dependent variable.
# Plot the residuals vs. fitted values.
import matplotlib.patches as mpatches
plt.figure(figsize=(10,6))
plt.scatter(lm.predict(X_train), lm.predict(X_train) - Y_train, c='b', s=40, alpha=0.5)
plt.scatter(lm.predict(X_test), lm.predict(X_test) - Y_test, c='g', s=40)
plt.hlines(y = 0, xmin=-30, xmax = 50)
plt.title('Residuals vs Fitted Values')
plt.ylabel('Residuals')
plt.xlabel('Fitted Values')
blue_patch = mpatches.Patch(color='b', label='Train')
green_patch = mpatches.Patch(color ='g', label='Test')
plt.legend(handles=[blue_patch, green_patch])
Out[13]:
In [14]:
# RANDOM FOREST MODEL
# Create an instance of the Random Forest model.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 1)
# Now, train the model with all the 7 predictors in the training set
rf.fit(X_train, Y_train)
# Check to see how much of the variance in the training data is captured by the model.
# Compute R-squared for this (also referred to as the coefficient of determination).
print ("R-squared for training data: ")
print (rf.score(X_train,Y_train))
print
print ("R-squared for test data: ")
print (rf.score(X_test,Y_test))
print
# Predict the output on the test set and print the first five values
print ("Percent inhibition for succeeding molecules in the test set:")
print (rf.predict(X_test)[0:5])
print
# Look at the mean squared errors
from sklearn.metrics import mean_squared_error
print ("Mean squared error on the training set: ")
print (mean_squared_error(Y_train, rf.predict(X_train)))
print
print ("Mean squared error on the test set: ")
print (mean_squared_error(Y_test, rf.predict(X_test)))
print
Out[14]:
This is also quite clear from the improvement in the MSE value for just the training set: 19.4 with random forest. However, MSE value is much higher for the test set. The model does not seem overfitted with the training data. Still, let's do some cross-validation so we can mitigate some overfitting if any.
In [15]:
# Let's do some cross-validation
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 1)
# import KFold from sklearn
from sklearn.cross_validation import KFold
# Split the dataset into 5 folds using the iterator
kf = KFold(len(X), n_folds=5)
# Now use the iterator to get indices of all 5 folds
# Initialize mse_test
mse_test = 0
for train_indices, test_indices in kf:
# Fit a random forest model with all 6 predictors to the training set
rf.fit(X[train_indices], Y[train_indices])
results = rf.predict(X[test_indices])
# Find the mean squared error of each split (test fold) and add it to the previous one
mse_test = mse_test + mean_squared_error(Y[test_indices], rf.predict(X[test_indices]))
print ("Average mean squared error across all 5 folds is: ")
print (mse_test/5)
In [19]:
# Now, I'd like to conduct a cluster analysis of the screening data. I'd like to find out if
# there are distinct groups of characteristics in which the compounds are aggregating to. By
# performing a silhouette analysis, we can study the separation distance between resulting clusters.
from sklearn.cluster import KMeans
import numpy as np
dataHTS_truncated = dataHTS.drop(dataHTS.columns[[0,1,2,3,4,5,6,9,10]], axis=1)
dataHTS_cols = np.matrix(dataHTS_truncated)
dataHTS_truncated.head(10)
Out[19]:
In [17]:
# K-MEANS CLUSTERING - SILHOUETTE Analysis
# Using silhoutte_score function laid out in scikit-learn,
# construct a series of silhouetter plots. By doing silhouette
# analysis, we can study the separation distance between the
# resulting clusters. The dataset I'll be using here is dataHTS_cols.
from __future__ import print_function
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
import numpy as np
print(__doc__)
range_n_clusters = range(2, 10)
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot.
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.2, 0.5]
ax1.set_xlim([-0.1, 1.0])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
# ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(dataHTS_cols)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(dataHTS_cols, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(dataHTS_cols, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters in screening dataset.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([])
# 2nd Plot show the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered screening data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on Dengue HTS Data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()